feat: add OLMES variant of BigCodeBench by tfburns · Pull Request #184 · Aleph-Alpha-Research/eval-framework

tfburns · 2026-02-26T09:16:16Z

PR Checklist

Use descriptive commit messages.
Provide tests for your changes.
Update any related documentation and include any relevant screenshots.
Check if changes need to be made to docs (README or any guides in /docs/).

What type of PR is this? (check all applicable)

Description

Adds a variant of the BigCodeBench task which mimics the OLMES implementation.

Added/updated tests?

Yes
No, and this is why: please replace this line with details on why tests
have not been included
I need help with writing tests

prabhuteja12 · 2026-02-26T15:14:29Z

src/eval_framework/tasks/benchmarks/bigcodebench.py

-        assert num_fewshot == 0, "Fewshot is not supported for BigCodeBench"
+        # Only the base BigCodeBench class disallows fewshot; subclasses (e.g. BigCodeBench_OLMES) may use it.
+        if self.__class__ is BigCodeBench and num_fewshot != 0:
+            raise ValueError("Fewshot is not supported for BigCodeBench; use BigCodeBench_OLMES for 3-shot.")


Why should this be an error?

Changed the raise ValueError to a logger.warning that logs the requested value and resets to num_fewshot=0, which is the existing implementation of our BigCodeBench task. But adding this here to avoid user confusion since, oppositely, BigCodeBench_OLMES only runs with num_fewshot=0.

prabhuteja12 · 2026-02-26T15:15:50Z

src/eval_framework/tasks/benchmarks/bigcodebench.py

+    def _get_fewshot_target_text(self, item: dict[str, Any]) -> str:
+        # Match oe_eval doc_to_target for complete: canonical_solution + "\\n```"
+        target = item["canonical_solution"]
+        assert target is not None and isinstance(target, str)


Ideally, raise a ValueError as asserts can be turned off globally

Replaced this with an explicit if not isinstance(target, str): raise ValueError(...).

tests/tests_eval_framework/tasks/task-prompts-hashes.json

prabhuteja12 · 2026-02-26T15:18:06Z

tests/tests_eval_framework/tasks/test_utils.py


        test_code = r"""
 import unittest
 class TestCases(unittest.TestCase):


Would you be able to rename these and have some description of what they are actually testing? I'm not sure why these tests uses unittest while the rest of the repo uses pytest.

Renamed these test methods to be a bit more descriptive added docstrings explaining that the unittest code in the test data strings reflects BigCodeBench's format, not our repo's test framework.

Those string gets passed to execute_python_code_with_tests(), which sends it to a Docker container where it's written to a file and run as a separate Python process. The import unittest happens inside the container's Python interpreter, not in the repo test runner's process.

…r BigCodeBench_OLMES task

tfburns and others added 5 commits February 26, 2026 08:28

feat: add OLMES variant of BigCodeBench

d9772ec

docs: update readme and BigCodeBench_OLMES docs

4afb0af

feat: cleanup unit tests

ba979e4

fix: prompt hashes for BigCodeBench are non-deterministic

127288b

Merge branch 'main' into big_code_bench

61bb27f

tfburns marked this pull request as ready for review February 26, 2026 14:41

prabhuteja12 reviewed Feb 26, 2026

View reviewed changes

tests/tests_eval_framework/tasks/task-prompts-hashes.json Show resolved Hide resolved

prabhuteja12 reviewed Feb 26, 2026

View reviewed changes

docs: improved error messaging/logic and test names and docstrings fo…

4546aa3

…r BigCodeBench_OLMES task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add OLMES variant of BigCodeBench#184

feat: add OLMES variant of BigCodeBench#184
tfburns wants to merge 6 commits intomainfrom
big_code_bench

tfburns commented Feb 26, 2026

Uh oh!

prabhuteja12 Feb 26, 2026

Uh oh!

tfburns Feb 26, 2026

Uh oh!

prabhuteja12 Feb 26, 2026

Uh oh!

tfburns Feb 26, 2026

Uh oh!

Uh oh!

prabhuteja12 Feb 26, 2026 •

edited

Loading

Uh oh!

tfburns Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tfburns commented Feb 26, 2026

PR Checklist

What type of PR is this? (check all applicable)

Description

Added/updated tests?

Uh oh!

prabhuteja12 Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

tfburns Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

prabhuteja12 Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

tfburns Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

prabhuteja12 Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tfburns Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

prabhuteja12 Feb 26, 2026 •

edited

Loading